Semi-supervised Data Clustering with Coupled Non-negative Matrix Factorization: Sub-category Discovery of Noun Phrases in NELL’s Knowledge Base

نویسندگان

  • Chunlei Liu
  • Tom Mitchell
چکیده

The standard non-negative matrix factorization (NMF) is a popular method to obtain low-rank approximation of a non-negative matrix, which is also powerful for clustering and classification in machine learning. In NMF each data sample is represented by a vector of features of the same dimension. In practice, we often have good side information for a subset of data samples. These side information might be binary vectors that indicate human provided class labels, or generic vectors in a different feature space. In this paper we propose the coupled non-negative matrix factorization (CNMF) method to automatically incorporate the side information of a subset of data. In CNMF, the matrix for data samples with or without side information in the original feature space and the matrix for data samples with side information in the new feature space are coupled together and iteratively optimized. Because of different qualities of the side information, a trade-off parameter is introduced to determine the importance of the side information, and we give a cross validation method to choose its value. The time complexity of the CNMF method could be several times bigger than the original NMF method, but still in the same order. As an example of implementing the CNMF method, we look into the knowledge base of the CMU Never-Ending Language Learning (NELL) project and find sub-categories of noun phrases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Which Noun Phrases Denote Which Concepts?

Resolving polysemy and synonymy is required for high-quality information extraction. We present ConceptResolver, a component for the Never-Ending Language Learner (NELL) (Carlson et al., 2010) that handles both phenomena by identifying the latent concepts that noun phrases refer to. ConceptResolver performs both word sense induction and synonym resolution on relations extracted from text using ...

متن کامل

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering

Semi-supervised clustering is often viewed as using labeled data to aid the clustering process. However, existing algorithms fail to consider dual constraints between data points (e.g. documents) and features (e.g. words). To address this problem, in this paper, we propose a novel semi-supervised document co-clustering model OSS-NMF via orthogonal nonnegative matrix tri-factorization. Our model...

متن کامل

Extracting Prior Knowledge from Data Distribution to Migrate from Blind to Semi-Supervised Clustering

Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated distribution of raw data in order to convert a blind clustering problem into a semi-supervised o...

متن کامل

A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem.  At each step of ALS algorithms two convex least square problems should be solved, which causes high com...

متن کامل

Nonnegative Matrix Factorizations for Clustering: A Survey

Recently there has been significant development in the use of non-negative matrix factorization (NMF) methods for various clustering tasks. NMF factorizes an input nonnegative matrix into two nonnegative matrices of lower rank. Although NMF can be used for conventional data analysis, the recent overwhelming interest in NMF is due to the newly discovered ability of NMF to solve challenging data ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013